discriminative ability
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > India (0.04)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- Research Report > New Finding (0.68)
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > India (0.04)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- Research Report > New Finding (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.69)
LPT++: Efficient Training on Mixture of Long-tailed Experts
Dong, Bowen, Zhou, Pan, Zuo, Wangmeng
Abstract--We introduce LPT++, a comprehensive framework for long-tailed classification that combines parameter-efficient fine-tuning (PEFT) with a learnable model ensemble. LPT++ enhances frozen Vision Transformers (ViTs) through the integration of three core components. The first is a universal long-tailed adaptation module, which aggregates long-tailed prompts and visual adapters to adapt the pretrained model to the target domain, meanwhile improving its discriminative ability. The second is the mixture of long-tailed experts framework with a mixture-of-experts (MoE) scorer, which adaptively calculates reweighting coefficients for confidence scores from both visual-only and visual-language (VL) model experts to generate more accurate predictions. Finally, LPT++ employs a three-phase training framework, wherein each critical module is learned separately, resulting in a stable and effective long-tailed classification training paradigm. Besides, we also propose the simple version of LPT++ namely LPT, which only integrates visual-only pretrained ViT and long-tailed prompts to formulate a single model method. LPT can clearly illustrate how long-tailed prompts works meanwhile achieving comparable performance without VL pretrained models. Experiments show that, with only 1% extra trainable parameters, LPT++ achieves comparable accuracy against all the counterparts.
- Asia > China > Heilongjiang Province > Harbin (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Asia > Singapore (0.04)
- Asia > China > Hong Kong (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.67)
Unified Text-to-Image Generation and Retrieval
Qu, Leigang, Li, Haochuan, Wang, Tan, Wang, Wenjie, Li, Yongqi, Nie, Liqiang, Chua, Tat-Seng
How humans can efficiently and effectively acquire images has always been a perennial question. A typical solution is text-to-image retrieval from an existing database given the text query; however, the limited database typically lacks creativity. By contrast, recent breakthroughs in text-to-image generation have made it possible to produce fancy and diverse visual content, but it faces challenges in synthesizing knowledge-intensive images. In this work, we rethink the relationship between text-to-image generation and retrieval and propose a unified framework in the context of Multimodal Large Language Models (MLLMs). Specifically, we first explore the intrinsic discriminative abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner. Subsequently, we unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images as the response to the text query. Additionally, we construct a benchmark called TIGeR-Bench, including creative and knowledge-intensive domains, to standardize the evaluation of unified text-to-image generation and retrieval. Extensive experimental results on TIGeR-Bench and two retrieval benchmarks, i.e., Flickr30K and MS-COCO, demonstrate the superiority and effectiveness of our proposed method.
- Europe > Switzerland > Zürich > Zürich (0.14)
- North America > United States > Florida > Orange County > Orlando (0.04)
- Europe > Sweden (0.04)
- (6 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Discriminative Probing and Tuning for Text-to-Image Generation
Qu, Leigang, Wang, Wenjie, Li, Yongqi, Zhang, Hanwang, Nie, Liqiang, Chua, Tat-Seng
Despite advancements in text-to-image generation (T2I), prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional understanding or integrating large language models for improved layout planning. However, the inherent alignment capabilities of T2I models are still inadequate. By reviewing the link between generative and discriminative modeling, we posit that T2I models' discriminative abilities may reflect their text-image alignment proficiency during generation. In this light, we advocate bolstering the discriminative abilities of T2I models to achieve more precise text-to-image alignment for generation. We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. As a bonus of the discriminative adapter, a self-correction mechanism can leverage discriminative gradients to better align generated images to text prompts during inference. Comprehensive evaluations across three benchmark datasets, including both in-distribution and out-of-distribution scenarios, demonstrate our method's superior generation performance. Meanwhile, it achieves state-of-the-art discriminative performance on the two discriminative tasks compared to other generative models.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Asia > Singapore (0.04)
- Europe > Netherlands > North Holland > Amsterdam (0.04)
- (5 more...)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.69)
Enhancing Multi-modal Cooperation via Fine-grained Modality Valuation
Wei, Yake, Feng, Ruoxuan, Wang, Zihe, Hu, Di
One primary topic of multi-modal learning is to jointly incorporate heterogeneous information from different modalities. However, most models often suffer from unsatisfactory multi-modal cooperation, which could not jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality, but are often hard to provide the fine-grained observation of multi-modal cooperation at sample-level with theoretical support. Hence, it is essential to reasonably observe and improve the fine-grained cooperation between modalities, especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end, we introduce a fine-grained modality valuation metric to evaluate the contribution of each modality at sample-level. Via modality valuation, we regretfully observe that the multi-modal model tends to rely on one specific modality, resulting in other modalities being low-contributing. We further analyze this issue and improve cooperation between modalities by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall, our methods reasonably observe the fine-grained uni-modal contribution at sample-level and achieve considerable improvement on different multi-modal models.
- Asia > China > Beijing > Beijing (0.04)
- Europe > North Macedonia > Skopje Statistical Region > Skopje Municipality > Skopje (0.04)
Incorporating Domain Knowledge Graph into Multimodal Movie Genre Classification with Self-Supervised Attention and Contrastive Learning
Li, Jiaqi, Qi, Guilin, Zhang, Chuanyi, Chen, Yongrui, Tan, Yiming, Xia, Chenlong, Tian, Ye
Multimodal movie genre classification has always been regarded as a demanding multi-label classification task due to the diversity of multimodal data such as posters, plot summaries, trailers and metadata. Although existing works have made great progress in modeling and combining each modality, they still face three issues: 1) unutilized group relations in metadata, 2) unreliable attention allocation, and 3) indiscriminative fused features. Given that the knowledge graph has been proven to contain rich information, we present a novel framework that exploits the knowledge graph from various perspectives to address the above problems. As a preparation, the metadata is processed into a domain knowledge graph. A translate model for knowledge graph embedding is adopted to capture the relations between entities. Firstly we retrieve the relevant embedding from the knowledge graph by utilizing group relations in metadata and then integrate it with other modalities. Next, we introduce an Attention Teacher module for reliable attention allocation based on self-supervised learning. It learns the distribution of the knowledge graph and produces rational attention weights. Finally, a Genre-Centroid Anchored Contrastive Learning module is proposed to strengthen the discriminative ability of fused features. The embedding space of anchors is initialized from the genre entities in the knowledge graph. To verify the effectiveness of our framework, we collect a larger and more challenging dataset named MM-IMDb 2.0 compared with the MM-IMDb dataset. The experimental results on two datasets demonstrate that our model is superior to the state-of-the-art methods. We will release the code in the near future.
- North America > Canada > Ontario > National Capital Region > Ottawa (0.05)
- Asia > China > Jiangsu Province > Nanjing (0.05)
- Asia > Middle East > Republic of Türkiye > Batman Province > Batman (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Media > Film (0.94)
- Leisure & Entertainment (0.68)
On (assessing) the fairness of risk score models
Petersen, Eike, Ganz, Melanie, Holm, Sune Hannibal, Feragen, Aasa
To date, much of the algorithmic fairness literature has focused on the fairness of classification systems which are used, for example, to decide whether a person should be granted a loan or be released from prison on bail. Even in cases where such classification decisions are based on risk score models - such as in the highly influential COMPAS case [5, 11, 16] - their fairness is typically considered a function of the decisions, or classifications, made by the system. Of course, any risk score model can be turned into a classifier by selecting a probability threshold (in binary classification) or predicting the most likely outcome (in multi-class classification). Nevertheless, we argue here that it is worthwhile to distinguish between these two settings and consider the fairness of risk models independent of their downstream use, be it as the basis for a classifier or otherwise. We discuss notions of fairness for risk scores as well as their relationship to classical, classification-level notions of fairness, and we develop robust tools to empirically quantify risk score fairness. We illustrate our methodology in two case studies, one situated in the criminal justice system and one in healthcare. Why distinguish between fair models and fair decisions? In the statistical literature, it is generally considered desirable to distinguish between inference (e.g., identifying a risk score model) and subsequent decision-making (e.g., deriving a classification from a risk score model): while the former represents a purely statistical task, the latter depends on subjective
- Europe > Denmark > Capital Region > Copenhagen (0.04)
- Europe > Germany (0.04)
- Europe > Spain > Catalonia (0.04)
- (8 more...)
- Research Report > Experimental Study (1.00)
- Research Report > New Finding (0.92)
- Law (1.00)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (1.00)
- Government > Regional Government (0.93)